Computer Vision and Pattern Recognition 5
♻ ☆ PnLCalib: Sports Field Registration via Points and Lines Optimization
Camera calibration in broadcast sports videos presents numerous challenges
for accurate sports field registration due to multiple camera angles, varying
camera parameters, and frequent occlusions of the field. Traditional
search-based methods depend on initial camera pose estimates, which can
struggle in non-standard positions and dynamic environments. In response, we
propose an optimization-based calibration pipeline that leverages a 3D soccer
field model and a predefined set of keypoints to overcome these limitations.
Our method also introduces a novel refinement module that improves initial
calibration by using detected field lines in a non-linear optimization process.
This approach outperforms existing techniques in both multi-view and
single-view 3D camera calibration tasks, while maintaining competitive
performance in homography estimation. Extensive experimentation on real-world
soccer datasets, including SoccerNet-Calibration, WorldCup 2014, and
TS-WorldCup, highlights the robustness and accuracy of our method across
diverse broadcast scenarios. Our approach offers significant improvements in
camera calibration precision and reliability.
comment: Extended version of "No Bells, Just Whistles: Sports Field
Registration Leveraging Geometric Properties"
♻ ☆ ERX: A Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line Scanning
Detecting unexpected objects (anomalies) in real time has great potential for
monitoring, managing, and protecting the environment. Hyperspectral line-scan
cameras are a low-cost solution that enhance confidence in anomaly detection
over RGB and multispectral imagery. However, existing line-scan algorithms are
too slow when using small computers (e.g. those onboard a drone or small
satellite), do not adapt to changing scenery, or lack robustness against
geometric distortions. This paper introduces the Exponentially moving RX
algorithm (ERX) to address these issues, and compares it with existing RX-based
anomaly detection methods for hyperspectral line scanning. Three large and more
complex datasets are also introduced to better assess the practical challenges
when using line-scan cameras (two hyperspectral and one multispectral). ERX is
evaluated using a Jetson Xavier NX compute module, achieving the best
combination of speed and detection performance. This research paves the way for
future studies in grouping and locating anomalous objects, adaptive and
automatic threshold selection, and real-time field tests. The datasets and the
Python code are available at: https://github.com/WiseGamgee/HyperAD.
comment: 17 pages, 13 figures, 4 tables, code and datasets accessible at
https://github.com/WiseGamgee/HyperAD
♻ ☆ RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment
Zutao Jiang, Guian Fang, Jianhua Han, Guansong Lu, Hang Xu, Shengcai Liao, Xiaojun Chang, Xiaodan Liang
Recent advances in text-to-image diffusion models have achieved remarkable
success in generating high-quality, realistic images from textual descriptions.
However, these approaches have faced challenges in precisely aligning the
generated visual content with the textual concepts described in the prompts. In
this paper, we propose a two-stage coarse-to-fine semantic re-alignment method,
named RealignDiff, aimed at improving the alignment between text and images in
text-to-image diffusion models. In the coarse semantic re-alignment phase, a
novel caption reward, leveraging the BLIP-2 model, is proposed to evaluate the
semantic discrepancy between the generated image caption and the given text
prompt. Subsequently, the fine semantic re-alignment stage employs a local
dense caption generation module and a re-weighting attention modulation module
to refine the previously generated images from a local semantic view.
Experimental results on the MS-COCO and ViLG-300 datasets demonstrate that the
proposed two-stage coarse-to-fine semantic re-alignment method outperforms
other baseline re-alignment techniques by a substantial margin in both visual
quality and semantic similarity with the input prompt.
♻ ☆ ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning ACCV 2024
In this paper, we introduce a framework ARBEx, a novel attentive feature
extraction framework driven by Vision Transformer with reliability balancing to
cope against poor class distributions, bias, and uncertainty in the facial
expression learning (FEL) task. We reinforce several data pre-processing and
refinement methods along with a window-based cross-attention ViT to squeeze the
best of the data. We also employ learnable anchor points in the embedding space
with label distributions and multi-head self-attention mechanism to optimize
performance against weak predictions with reliability balancing, which is a
strategy that leverages anchor points, attention scores, and confidence values
to enhance the resilience of label predictions. To ensure correct label
classification and improve the models' discriminative power, we introduce
anchor loss, which encourages large margins between anchor points.
Additionally, the multi-head self-attention mechanism, which is also trainable,
plays an integral role in identifying accurate labels. This approach provides
critical elements for improving the reliability of predictions and has a
substantial positive effect on final prediction capabilities. Our adaptive
model can be integrated with any deep neural network to forestall challenges in
various recognition tasks. Our strategy outperforms current state-of-the-art
methodologies, according to extensive experiments conducted in a variety of
contexts.
comment: Extended version is accepted in ACCV 2024 as GReFEL
♻ ☆ ReCAP: Recursive Cross Attention Network for Pseudo-Label Generation in Robotic Surgical Skill Assessment
In surgical skill assessment, the Objective Structured Assessments of
Technical Skills (OSATS) and Global Rating Scale (GRS) are well-established
tools for evaluating surgeons during training. These metrics, along with
performance feedback, help surgeons improve and reach practice standards.
Recent research on the open-source JIGSAWS dataset, which includes both GRS and
OSATS labels, has focused on regressing GRS scores from kinematic data, video,
or their combination. However, we argue that regressing GRS alone is limiting,
as it aggregates OSATS scores and overlooks clinically meaningful variations
during a surgical trial. To address this, we developed a recurrent transformer
model that tracks a surgeon's performance throughout a session by mapping
hidden states to six OSATS, derived from kinematic data, using a clinically
motivated objective function. These OSATS scores are averaged to predict GRS,
allowing us to compare our model's performance against state-of-the-art (SOTA)
methods. We report Spearman's Correlation Coefficients (SCC) demonstrating that
our model outperforms SOTA using kinematic data (SCC 0.83-0.88), and matches
performance with video-based models. Our model also surpasses SOTA in most
tasks for average OSATS predictions (SCC 0.46-0.70) and specific OSATS (SCC
0.56-0.95). The generation of pseudo-labels at the segment level translates
quantitative predictions into qualitative feedback, vital for automated
surgical skill assessment pipelines. A senior surgeon validated our model's
outputs, agreeing with 77% of the weakly-supervised predictions (p=0.006).